Classifying Music Genres Using Spotify Audio Features

Summary

In this project, we aim to answer the question of whether Spotify’s audio features can be used to predict song genres. We use a KNN classification model that we evaluate using ROC/AUC, and produce visualizations and statistics to summarize our findings.

Introduction

Music recommendation systems have experienced rapid growth in the past decade in terms both of the increasingly influential role they play in media consumption and the research that is invested in improving these algorithms. Despite the sophisticated techniques and ample data available today, reports of music genre classification show a wide variety of results, in part due to the plethora of possible features to classify a song on (Singh et al., 2022).

Projects involving deep learning have shown a wide variety of results with often limited success (Pelchet et al., 2020), while traditional machine learning classification based on musical characteristics (such as tempo, pitch, and chord progression) seem to be relatively accurate (Ndou et al., 2021). However, these approaches to classification are distinctly different from that of Spotify, one of the most prevalent streaming services today that boasts an impressive recommendation algorithm. Rather than conforming to genre classification, the Spotify algorithm emphasizes personalized recommendations for each user, introducing a certain amount of bias that complicates the problem even further. Interestingly enough, Spotify has been so proficient in tailoring its algorithm for its users’ listening habits that they are facing critiques for decreasing exposure and discoverability of diverse music genres (Snickars, 2017). Although they have released their database of songs for public use via Web API, and even made their custom audio features available (eg. speechiness, liveliness, etc.), there is still relatively little detail on how these features are used in their algorithm.

Because Spotify’s audio features for music personalization vary so much from what is commonly selected for classification rooted in music theory yet are so relevant in music personalization, we wonder how these features may perform for a slightly different but very related task of genre classification.

Our goal is to discover how well Spotify’s custom audio features are able to predict common genres of music.

# Load necessary packages 
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.1     ✔ purrr   0.3.4
## ✔ tibble  3.2.0     ✔ dplyr   1.0.8
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(repr)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──
## ✔ broom        0.8.0     ✔ rsample      0.1.1
## ✔ dials        0.1.1     ✔ tune         0.2.0
## ✔ infer        1.0.0     ✔ workflows    0.2.6
## ✔ modeldata    0.1.1     ✔ workflowsets 0.2.1
## ✔ parsnip      0.2.1     ✔ yardstick    0.0.9
## ✔ recipes      0.2.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(RCurl)
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete
library(kknn)
library(cowplot)
library(testthat)
## 
## Attaching package: 'testthat'
## The following object is masked from 'package:rsample':
## 
##     matches
## The following object is masked from 'package:dplyr':
## 
##     matches
## The following object is masked from 'package:purrr':
## 
##     is_null
## The following objects are masked from 'package:readr':
## 
##     edition_get, local_edition
## The following object is masked from 'package:tidyr':
## 
##     matches
library(here)
## here() starts at /home/rstudio/workspace
# Set default rows displayed for dataframes
options(repr.matrix.max.rows = 6)

# Set seed for reproducibility 
set.seed(1)

The Dataset

We will be using the Spotify Songs dataset obtained here: https://github.com/dliu0049/tidytuesday_wc/tree/master/data/2020/2020-01-21.

# Read the data from the web into jupyter
url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv"
x <- getURL(url)
song_data <- read.csv(text = x)
# Ran once to write the data into a local folder, don't need to run again
# write.csv(song_data,"data/spotify_songs.csv")
# Overview of the original dataset
song_data

Table 1.1 - Original dataset ## Preliminary Data Analysis

To tidy the data, from song_data we first selected only the predicted feature playlist_genre and the predictor features that relate to the acoustic aspects of the songs. Then, the datatype of playlist_genre was converted from character to factor and the resulting dataframe was named tidy_song_data.

# load in the function
source(here("R/classy_read.R"))

# Tidy the data, only keep the relevant columns
# Change datatype of playlist_genre from character to factor in order for some of the functions to work later
# Check number of missing values in each column of the table in the training data

tidy_song_data <- classy_read(url, "playlist_genre", playlist_genre, danceability:tempo)
##   playlist_genre danceability energy key loudness mode speechiness acousticness
## 1              0            0      0   0        0    0           0            0
##   instrumentalness liveness valence tempo
## 1                0        0       0     0
head(tidy_song_data)

Table 1.2 - Tidy dataset

There is no missing data in the dataset.

Next we seperated the tidy_song_data into a training_song_data set with which to build our classifier, and a testing_song_data set with which to evaluate it with later on. The ratio of the training/testing split is 75/25%, and the relative proportion of each playlist_genre category was preserved in each set.

# Split the data into training and testing sets at 75:25 ratio
set.seed(1) # Set the seed for reproducability

split_song_data <- initial_split(tidy_song_data, prop = 0.75, strata = playlist_genre)

# training set
training_song_data <- training(split_song_data)

# testing set
testing_song_data <- testing(split_song_data)

For preliminary data analysis, two things were checked, only using the training_song_data. First, the proportions of the playlist_genre, just to make sure the split divided it properly, and second, the amount of NA values in the dataset, which is important to ensure the set is suitable for analysis.

# Preliminary data analysis

# Get proportions of genres from data using a function
# load in the function
source("R/count_proportion.r")

# Get proportions of genres from tidy data
tidy_prop <- count_proportion(tidy_song_data, 'playlist_genre', "tidy")

# Get proportions of genres from training data
train_prop <- count_proportion(training_song_data, 'playlist_genre', "train")

# Combine dataframes to compare
df_list <- list(train_prop, tidy_prop)      

prop_df <- df_list %>% reduce(full_join, by='playlist_genre')
prop_df[,c(1,2,4,3,5)]

Table 1.3.1 - Proportions of each genre in training set compared to tidy data

The proportions are about the same in both datasets.

#Check number of missing values in each column of the table in the training data
num_na <- training_song_data|> 
            summarize_all(~sum(is.na(.))) 
num_na

Preliminary Data Visualizations

For the preliminary data visualization, histograms were created comparing the audio features between each of the playlist_genre categories.

NOTE: This cell may take around 20 seconds to run

# Preliminary data visualization
# Histograms of each of the features that we are using, differentiated by labeled genre
options(repr.plot.width = 15, repr.plot.height = 15)

danceability_hist <- ggplot(training_song_data, aes(x = danceability)) +
                            geom_histogram(bins=20) +
                            facet_grid(rows = "playlist_genre")

energy_hist <- ggplot(training_song_data, aes(x = energy)) +
                            geom_histogram(bins=20) +
                            facet_grid(rows = "playlist_genre")

danceability_hist <- ggplot(training_song_data, aes(x = danceability)) +
                            geom_histogram(bins=20) +
                            facet_grid(rows = "playlist_genre")

key_hist <- ggplot(training_song_data, aes(x = key)) +
                            geom_histogram(bins=20) +
                            facet_grid(rows = "playlist_genre")

loudness_hist <- ggplot(training_song_data, aes(x = loudness)) +
                            geom_histogram(bins=20) +
                            facet_grid(rows = "playlist_genre")

mode_hist <- ggplot(training_song_data, aes(x = mode)) +
                            geom_histogram(bins=20) +
                            facet_grid(rows = "playlist_genre")

speechiness_hist <- ggplot(training_song_data, aes(x = speechiness)) +
                            geom_histogram(bins=20) +
                            facet_grid(rows = "playlist_genre")

acousticness_hist <- ggplot(training_song_data, aes(x = acousticness)) +
                            geom_histogram(bins=20) +
                            facet_grid(rows = "playlist_genre")

instrumentalness_hist <- ggplot(training_song_data, aes(x = instrumentalness)) +
                            geom_histogram(bins=20) +
                            facet_grid(rows = "playlist_genre")

liveness_hist <- ggplot(training_song_data, aes(x = liveness)) +
                            geom_histogram(bins=20) +
                            facet_grid(rows = "playlist_genre")

valence_hist <- ggplot(training_song_data, aes(x = valence)) +
                            geom_histogram(bins=20) +
                            facet_grid(rows = "playlist_genre")

tempo_hist <- ggplot(training_song_data, aes(x = tempo)) +
                            geom_histogram(bins=20) +
                            facet_grid(rows = "playlist_genre")

# Plot all the histograms together 
plot_grid(danceability_hist, energy_hist, danceability_hist, key_hist, loudness_hist, mode_hist, speechiness_hist, acousticness_hist, instrumentalness_hist, liveness_hist, valence_hist ,tempo_hist, ncol = 4, labels = "AUTO")

Figure 1.4 - Preliminary data visualization

The histograms above show that while the distributions of certain audio features in songs are similar between different genres, across all the features there are differences in the central tendency and that these differences may provide enough information to acheive a reasonable accuracy with the classifier.

Methods and Analysis

The following steps show how we build the classifier:

First, we scale and center the predictors so that impact of each variable is equal. Then we create a recipe with the target variable playlist_genre that uses all the training data, and set up tuning for the best k value.

# Scale predictors, use standard recipe, setup knn_spec to tune for best k value
song_recipe <- recipe(playlist_genre ~ ., data = training_song_data) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", 
                             neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

Next, we conduct 5-fold cross validation in order to find the most suitable hyperparameter by getting better estimates of the accuracy of each k value.

NOTE: The cell below will take around 5 minutes to load due to the size of the dataset

# Cross validating and finding the best hyperparameter for the model

set.seed(1) # Setting seed for reproducibility

# Trying 15 different k values, count by every 10 from 1 to 51
k_vals <- tibble(neighbors = seq(from = 1, to = 51, by = 10))

song_vfold <- vfold_cv(training_song_data, v = 5, strata = playlist_genre)

knn_results <- workflow() |>
  add_recipe(song_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = song_vfold, grid = k_vals) |>
  collect_metrics() 

accuracies <- knn_results |>
  filter(.metric == "accuracy")

The following table and plot summmarize these results:

# Accuracy table for different k values
accuracies |>  arrange(desc(mean))

Table 2.1 - Accuracy of the different k values

# Plot the different accuracies of k, 
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", y = "Accuracy Estimate") + 
  theme(text = element_text(size = 15))

NOTE: For whatever reason, the plot below does not load correctly in the github preview, but looks fine locally

options(repr.plot.width = 5, repr.plot.height = 5)
accuracy_vs_k

Figure 2.2 - Accuracies of the different k values

As seen in the plot above, the k value that gives the most accuracy before diminishing returns is 11, at an accuracy of ~47.3

NOTE: The code below will also take a while to run

# Calculate accuracy of the model using the best k and cross-validation

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 11) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_fit <- workflow() |>
  add_recipe(song_recipe) |>
  add_model(knn_spec) |>
  fit_resamples(resamples = song_vfold)

accuracy_value <- knn_fit |> 
  collect_metrics() 
# Display accuracy of the model
accuracy_value

Table 2.3 - Accuracy of the model on validation data

Finally, after seeing our model perform on the validation data with an accuracy of approximately 0.473, we then fit our model with the optimized k value on the testing set to see how it performs on new data.

#test predictions using test-data
knn_fit <- workflow() |>
  add_recipe(song_recipe) |>
  add_model(knn_spec) |>
  fit(training_song_data)

song_test_predictions <- predict(knn_fit, testing_song_data) |>
  bind_cols(testing_song_data)

accuracy_only <- song_test_predictions |>
  metrics(truth = playlist_genre, estimate = .pred_class) |>
  filter(.metric == "accuracy")


confusion <- song_test_predictions |>
             conf_mat(truth = playlist_genre, estimate = .pred_class)
# Accuracy of the model on testing data
accuracy_only

Table 2.4 - Accuracy of model on testing data

On the testing data, our model produces an accuracy of approximately 0.468, which is only slightly lower than what we saw on the validation data. Some of its predictions are shown in the table below:

# A table of predictions of the model
song_test_predictions

Table 2.5 - Predictions on the data

# Confusion
confusion
##           Truth
## Prediction edm latin pop r&b rap rock
##      edm   952   146 277  74 122  111
##      latin 105   475 181 170 184   49
##      pop   215   221 391 203 100  139
##      r&b    60   178 190 483 207  126
##      rap    81   178 114 296 771   39
##      rock   98    91 224 132  53  774

Table 2.6 - Confusion Matrix mapping out song’s actual genres to their classifier predicted genres.

# Data visualization 

matrix_plot <- autoplot(confusion, type = "mosaic") + aes(fill = rep(colnames(confusion$table), ncol(confusion$table))) + labs(fill = "Predicted")
matrix_plot

Figure 2.7 - Respective proportions of predictions in each genre for each of the genres.

We can see that for each genre, the classifier predicted the correct playlist_genre more commonly than any other particular category, but that this did not reach the majority of predictions in some cases. The visualization also shows which genre’s were most or least likely to be mistaken for each other, e.g. rock and rap were not likely to have been predicted for each other.

Discussion

From fully conducting our analysis, we were able to create a KNN Classification model that produced an accuracy of approximately 46.8% when predicting the genre of songs based on Spotify audio features in our testing data.

This was within the realm of our expectations as, given the existing literature on this topic, genre classification has been known to produce a wide variety of results based on the variables selected (Singh et al., 2022). Furthermore, it is unclear whether Spotify’s audio predictors were generated for the purpose of typical genre classification, therefore we could only guess at how this model could perform. In fact, the aim of this project was to get a better idea of how these features were used, something that we can say we now have a better grasp of. Despite our lower overall accuracy (46.8%) on the testing data, it is worth noting that this result is only slightly lower than that of the validation accuracy at 47.3%. This could suggest that we were able to appropriately employ data preprocessing and cross-validation to have minimized data imbalance and overfitting/underfitting the training data, to create a robust model.

The impact of our findings would serve to add to the research on song genre classification and the effects of using different features. It is interesting that our accuracy was relatively low, despite how Spotify’s own algorithm is very well known while using these exact features. This could suggest that Spotify does not use these features for broad genre classification and instead for the sub-genres that are seen in the original dataset, or that they are more focussed on personalized music recommendations for users and not genre classification at all.

These ideas could lead to a future project on how well Spotify’s audio features work on classifying user listening habits (eg. Predicting whether or not a song is in someone’s “Liked Songs”). Of course, we chose to use the rather simple model of KNN Classification, so this research would be able to help us discern whether the Spotify features are indeed used mainly for recommending music, or if our model was too naive. On a broader scope, our findings may hopefully encourage research on whether genre classification using the more human-define features (eg. liveliness, valence) that Spotify uses, as opposed to more objective musical features, is worth further exploration.

References

Ndou, Ndiatenda, et al. “Music Genre Classification: A Review of Deep-Learning and Traditional Machine-Learning Approaches.” 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), 2021, https://doi.org/10.1109/iemtronics52119.2021.9422487.

Pelchat, Nikki, and Craig M. Gelowitz. “Neural Network Music Genre Classification.” Canadian Journal of Electrical and Computer Engineering, vol. 43, no. 3, 11 Aug. 2020, pp. 170–173., https://doi.org/10.1109/cjece.2020.2970144.

Singh, Yeshwant, and Anupam Biswas. “Robustness of Musical Features on Deep Learning Models for Music Genre Classification.” Expert Systems with Applications, vol. 199, 2022, p. 116879., https://doi.org/10.1016/j.eswa.2022.116879.

Snickars, Pelle. “More of the Same – On Spotify Radio.” Culture Unbound, vol. 9, no. 2, 2017, https://doi.org/10.3384/cu.2000.1525.1792.